San Fermín: Aggregating Large Data Sets Using a Binomial Swap Forest

نویسندگان

  • Justin Cappos
  • John H. Hartman
چکیده

San Fermı́n is a system for aggregating large amounts of data from the nodes of large-scale distributed systems. Each San Fermı́n node individually computes the aggregated result by swapping data with other nodes to dynamically create its own binomial tree. Nodes that fall behind abort their trees, thereby reducing overhead. Having each node create its own binomial tree makes San Fermı́n highly resilient to failures and ensures that the internal nodes of the tree have high capacity, thereby reducing completion time. Compared to existing solutions, San Fermı́n handles large aggregations better, has higher completeness when nodes fail, computes the result faster, and has better scalability. We analyze the completion time, completeness, and overhead of San Fermı́n versus existing solutions using analytical models, simulation, and experimentation with a prototype built on peer-to-peer system deployed on PlanetLab. Our evaluation shows that San Fermı́n is scalable both in the number of nodes and in the aggregated data size. San Fermı́n aggregates large amounts of data significantly faster than existing solutions: compared to SDIMS, an existing aggregation system, San Fermı́n computes a 1MB result from 100 PlanetLab nodes in 61–76% of the time and from 2-6 times as many nodes. Even if 10% of the nodes fail during aggregation, San Fermı́n still includes the data from 97% of the nodes in the result and does so faster than the underlying peer-to-peer system recovers from failures.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Estimation of Count Data using Bivariate Negative Binomial Regression Models

Abstract Negative binomial regression model (NBR) is a popular approach for modeling overdispersed count data with covariates. Several parameterizations have been performed for NBR, and the two well-known models, negative binomial-1 regression model (NBR-1) and negative binomial-2 regression model (NBR-2), have been applied. Another parameterization of NBR is negative binomial-P regression mode...

متن کامل

Comparison of Ordinal Response Modeling Methods like Decision Trees, Ordinal Forest and L1 Penalized Continuation Ratio Regression in High Dimensional Data

Background: Response variables in most medical and health-related research have an ordinal nature. Conventional modeling methods assume predictor variables to be independent, and consider a large number of samples (n) compared to the number of covariates (p). Therefore, it is not possible to use conventional models for high dimensional genetic data in which p > n. The present study compared th...

متن کامل

Implementation and Evaluation of Binary Swap Volume Rendering on a Commodity–Based Visualization Cluster

This paper describes the implementation and performance evaluation of a parallel volume renderer capable of handling large volumetric data sets. We implement our volume renderer on a cluster of 32 Linux PC’s using OpenGL, MPI and a binary–swap compositing algorithm. We also give hints for achieving good performance when using OpenGL and MPI on a Linux visualization cluster.

متن کامل

Assessing the efficiency of dye-swap normalization to remove systematic bias from two-color microarray data

Microarrays are a powerful tool in functional genomics, what allow a simultaneous analysis of the expression level of thousands of genes under different conditions. In order to compare measurements within and across arrays and to correct non-biological variation masking meaningful information, normalization is an essential task prior to any further analysis. Among all the available normalizatio...

متن کامل

Towards scaling up induction of second-order decision tables

One of the fundamental challenges for data mining is to enable inductive learning algorithms to operate on very large databases. Ensemble learning techniques such as bagging have been applied successfully to improve accuracy of classification models by generating multiple models, from replicate training sets, and aggregating them to form a composite model. In this paper, we adapt the bagging ap...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008